fix: avx512vl masked load/store by DiamonDinoia · Pull Request #1353 · xtensor-stack/xsimd

DiamonDinoia · 2026-05-21T07:24:01Z

xsimd::batch<uint32_t, avx512vl_256>::store(ptr, constexpr_mask, mode) used
to compile to a 6-instruction per-lane scalar extract loop instead of a single
EVEX-encoded masked store, because the call site in xsimd_batch.hpp:766
over-specified template arguments and SFINAE'd away every viable masked-store
overload except the scalar common fallback.

The load side was unaffected — its call site (:743) was already correct.

#include <xsimd/xsimd.hpp>

using A    = xsimd::avx512vl_256;
using Bu32 = xsimd::batch<uint32_t, A>;

// Constant alternating mask: lanes 0,2,4,6 active.
struct alt {
    static constexpr bool get(std::size_t i, std::size_t) { return (i & 1) == 0; }
};

static constexpr auto mask = xsimd::make_batch_bool_constant<uint32_t, alt, A>();

Bu32 load_u32(uint32_t const* p) {
    return Bu32::load(p, mask, xsimd::unaligned_mode{});
}

void store_u32(uint32_t* p, Bu32 v) {
    v.store(p, mask, xsimd::unaligned_mode{});
}

g++ -O3 -S -masm=intel -std=c++14 -march=skylake-avx512 -DXSIMD_DEFAULT_ARCH=xsimd::avx512vl_256

Codegen — before (master, commit `7d30b9cc`)

_Z8load_u32PKj:                  # load_u32
    mov         eax, 85
    kmovb       k1, eax
    vmovdqu32   ymm0{k1}{z}, YMMWORD PTR [rdi]
    ret

_Z9store_u32PjN5xsimd5batchIjNS0_12avx512vl_256EEE:   # store_u32
    valignd       ymm1, ymm0, ymm0, 6
    vmovd         DWORD PTR  [rdi], xmm0
    vpextrd       DWORD PTR 8[rdi], xmm0, 2
    vextracti32x4 xmm2, ymm0, 1
    vmovd         DWORD PTR 24[rdi], xmm1
    vmovd         DWORD PTR 16[rdi], xmm2
    ret

Load is fine. Store is the scalar fallback: GCC unrolled the 8-lane loop in
xsimd_common_memory.hpp:377 into four per-lane 32-bit stores, materialising
each active lane no mask instruction involved.

Codegen — after

_Z8load_u32PKj:                  # load_u32 — unchanged
    mov         eax, 85
    kmovb       k1, eax
    vmovdqu32   ymm0{k1}{z}, YMMWORD PTR [rdi]
    ret

_Z9store_u32PjN5xsimd5batchIjNS0_12avx512vl_256EEE:   # store_u32
    mov         eax, 85
    kmovb       k1, eax
    vmovdqu32   YMMWORD PTR [rdi]{k1}, ymm0
    ret

One masked instruction.

Bug fixes

AVX-512VL masked store collapsed to a 6-insn scalar fallback. An over-specified template arg list at the public batch::store(mask) call site pushed a type into a non-type pack, SFINAE'ing every per-arch overload away. The store now reaches the EVEX vmov*{k} intrinsic.
CI linux.yml typo silently dropped CXXFLAGS in the VL_128 matrix row ($CXX_FLAGS vs $CXXFLAGS), so the VL_128-default test job was building with stock flags instead of the requested override.
avx512vl register-traits comment misattributed to AVX512DQ.
Half-confined masked op fell back to scalar on AVX/AVX2/AVX-512F. The half-fold hardcoded sse4_2/avx2 as the half-width target, but two-phase lookup made the better-arch overload invisible at template-definition time — so the recursive call dispatched to the wrong arch. Fixed by include reorder + make_sized_batch_t<T, half>.

Features added

Per-type load_masked / store_masked for avx512vl_128 and avx512vl_256 covering i32/u32/i64/u64/f32/f64 in both aligned and unaligned modes; partial ordering picks them over the avx2 bridges these archs
inherit.
Compile-time guarantee that default_arch matches XSIMD_DEFAULT_ARCH when the macro is set — a static_assert in test_arch.cpp (plus the CMake plumbing to forward the macro) catches default-arch wiring regressions at compile time instead of at runtime.

Cleanups

AVX detail::maskstore helpers now take batch<> types, symmetric with detail::maskload.
Half-store sites and the common select use a typed variable (const batch<T, A> lo = …) instead of const auto lo = batch<T, A>{ … }.
Aligned/unaligned dispatch in the VL overloads uses XSIMD_IF_CONSTEXPR, so the inactive intrinsic isn't instantiated.
Half-fold in xsimd_avx.hpp / xsimd_avx2.hpp / xsimd_avx512f.hpp uses make_sized_batch_t<T, half> instead of a hardcoded arch — picks avx_128 / avx2_128 / avx512vl_256 when available, so half-confined stores land on the EVEX or VEX masked intrinsic.
xsimd_isa.hpp include order: _128 siblings before their wider arch, VL before avx512f.hpp. Required so the recursive store_masked<half_arch> call sees the better-arch overload at template-definition time.

(Changelog by @claude)

serge-sans-paille · 2026-05-23T06:34:21Z

        if [[ '${{ matrix.sys.flags }}' == 'avx512vl_128' ]]; then
          CMAKE_EXTRA_ARGS="$CMAKE_EXTRA_ARGS -DTARGET_ARCH=skylake-avx512"
-          CXXFLAGS="$CXX_FLAGS -DXSIMD_DEFAULT_ARCH=avx512vl_128"
+          CXXFLAGS="$CXXFLAGS -DXSIMD_DEFAULT_ARCH=avx512vl_128"


oopsie. Thanks for fixing this one.

serge-sans-paille

I wish we could keep each architecture file unaware from other architectures. I do understand this will disappear once we move to C++17-based architecture, but I'd be happier if you could find another way to apply the constraints.

serge-sans-paille · 2026-05-23T06:35:42Z

+        // and the VL native as equally specialized for A=avx512vl_*. (bridge_not_vl in fwd.hpp)
        template <class A, bool... Values, class Mode>
-        XSIMD_INLINE batch<int32_t, A> load_masked(int32_t const* mem, batch_bool_constant<int32_t, A, Values...>, convert<int32_t>, Mode, requires_arch<A>) noexcept
+        XSIMD_INLINE std::enable_if_t<bridge_not_vl<A>::value, batch<int32_t, A>>


there should not be any arch-specific code in xsimd_common_memory.hpp. Could you find another approach?

serge-sans-paille · 2026-05-23T06:38:09Z


            template <class A>
-            XSIMD_INLINE void maskstore(double* mem, batch_bool<double, A> const& mask, batch<double, A> const& src) noexcept
+            XSIMD_INLINE void maskstore(double* mem, batch<as_integer_t<double>, A> const& mask, batch<double, A> const& src) noexcept


wgy that change? In my mental model, the masked store take a bool mask, not an integer mask

We _mm256_maskstore_ps takes a __m256i for mask. The bool mask here available in the function calling this utility is backed by a floating point type.

serge-sans-paille · 2026-05-23T06:39:00Z

        // single templated implementation for integer masked loads (32/64-bit)
-        template <class A, class T, bool... Values, class Mode>
+        template <class A, class T, bool... Values, class Mode,
+                  class = std::enable_if_t<std::is_base_of<avx2, A>::value && !std::is_base_of<avx512vl_256, A>::value>>


I quite dislike the fact that xsimd_avx2.hpp needs to know stuff about avx512vl

I tried, I'll have another look. Without this constraint all compilers work fine except gcc-10 :(

serge-sans-paille · 2026-05-23T06:40:23Z

        }
-        template <class A, bool... Values, class Mode>
+        template <class A, bool... Values, class Mode,
+                  class = std::enable_if_t<std::is_base_of<avx_128, A>::value && !std::is_base_of<avx512vl_128, A>::value>>


same architecture mix reference issue here.

serge-sans-paille · 2026-05-23T06:40:46Z

 #if XSIMD_WITH_AVX
-#include "./xsimd_avx.hpp"
+// clang-format off
+// _128 first: avx half-fold recursive call needs avx_128 visible at parse time.


nice catch.

serge-sans-paille · 2026-05-23T06:41:02Z

     * @ingroup architectures
     *
-     * AVX512DQ instructions
+     * AVX512VL instructions


serge-sans-paille · 2026-05-23T06:41:22Z

+    template <typename T, std::size_t N>
+    struct make_sized_batch;
+    template <typename T, std::size_t N>
+    using make_sized_batch_t = typename make_sized_batch<T, N>::type;


serge-sans-paille · 2026-05-23T06:43:04Z

    message(STATUS "Using emulated target: ${TARGET_EMULATED}")
    set(EMULATED_COMPILE_FLAGS -DXSIMD_DEFAULT_ARCH=${TARGET_ARCH};-DXSIMD_WITH_EMULATED=1)
    unset(TARGET_ARCH CACHE)
+elseif (DEFINED XSIMD_DEFAULT_ARCH AND NOT "${XSIMD_DEFAULT_ARCH}" STREQUAL "")


Per https://cmake.org/cmake/help/latest/command/if.html#constant I think

if(XSIMD_DEFAULT_ARCH)

is enough

(confirmed by https://godbolt.org/z/b64GcKaxM)

serge-sans-paille · 2026-05-23T06:46:04Z

 static_assert(xsimd::all_architectures::contains<xsimd::default_arch>(), "default arch is a valid arch");
+#else
+namespace xsimd
+{


AntoinePrv · 2026-05-25T16:21:35Z

@DiamonDinoia I run into similar issues of load_masked in #1348 where I also fixed and improved the linux.yml CI.

I haven't look at assembly (just making it build with the new updated CI settings). The C++ diff should be fairly small, I wonder if you'd be able to get anything useful from it. I was able to simplify much more the functions from common_memory while keeping them lower priority (I believe). However I think the case common memory casting to float and calling again into the original arch is a big footgun. Would be better IMHO to explicitly make that call everywhere (with possibly a second / different utility function).

That being said, my solution currently stalls on avx512vl_128/avx512vl_256...

Let CMake force a specific default arch via -DXSIMD_DEFAULT_ARCH (idiomatic if(XSIMD_DEFAULT_ARCH) guard), add a test_arch.cpp check that the forced arch is the default, and fix the linux.yml CXXFLAGS typo.

Split the avx_128 variable swizzle into explicit float/double overloads with a width static_assert, and fix an AVX512DQ -> AVX512VL doc comment.

Add the missing int64/uint64/float/double load_masked overloads and correct the store_masked batch_bool_constant typing on avx512vl_128 and avx512vl_256, branching aligned vs unaligned to the right EVEX intrinsic (vmovdqu{32,64}{k}{z} / vmov{a,u}p{s,d}{k}{z}); unsigned overloads delegate via bitwise_cast. Resolve the avx/avx2/avx512f half-fold target through make_sized_batch_t<T, half>::arch_type so a 512-bit masked op picks the VL arch and emits EVEX instead of VEX vpmaskmov*/vmaskmov*.

Drop the cross-arch SFINAE/tag mechanism: a concrete requires_arch<avx512vl_128|256> overload now beats the inherited avx2/avx2_128 one by overload conversion ranking, so no arch file knows about another. xsimd_common_memory.hpp keeps only requires_arch<common> and dispatches on the arch-agnostic trait masked_memory_uses_fp_bitcast (integral with a same-width float register -> reuse that float vmaskmov* path, else a scalar buffer). avx/avx2/avx2_128 drop every is_base_of<avx512vl_*, A> guard; avx2_128 routes native 128-bit integer masked memory through vpmaskmov* (long long* cast for 64-bit) and tags int64/uint64 on avx2_128 (those intrinsics need AVX2). detail::maskstore takes a bool mask and casts internally; xsimd_batch.hpp keeps a make_sized_batch fwd-decl and simplifies the store_masked call; xsimd_isa.hpp documents the _128-first include order; sse2.hpp adapts to the new store_masked(common) signature.

DiamonDinoia · 2026-06-01T18:25:51Z

@serge-sans-paille ready for a second round of review. Probably over commented because I chatted with @claude how to best do this in c++14 (No if constexpr...) my solutions where so SFINAE heavy so I asked multiple times how to simplify this and the final outcome is this one. I left the comments in to kind of explain a bit more, happy to trim/clean them after a second round of review.

DiamonDinoia force-pushed the fix/avx512vl-masked-memory branch 5 times, most recently from fe2938e to ea882e6 Compare May 21, 2026 12:04

DiamonDinoia marked this pull request as ready for review May 21, 2026 12:51

DiamonDinoia requested a review from serge-sans-paille May 21, 2026 12:52

serge-sans-paille reviewed May 23, 2026

View reviewed changes

serge-sans-paille requested changes May 23, 2026

View reviewed changes

DiamonDinoia added 3 commits June 1, 2026 11:51

ci: support XSIMD_DEFAULT_ARCH override and verify default_arch

108a99b

Let CMake force a specific default arch via -DXSIMD_DEFAULT_ARCH (idiomatic if(XSIMD_DEFAULT_ARCH) guard), add a test_arch.cpp check that the forced arch is the default, and fix the linux.yml CXXFLAGS typo.

chore: small drive-by fixes (avx_128 swizzle, doc typo)

9ebcf0f

Split the avx_128 variable swizzle into explicit float/double overloads with a width static_assert, and fix an AVX512DQ -> AVX512VL doc comment.

DiamonDinoia force-pushed the fix/avx512vl-masked-memory branch from ea882e6 to fa06792 Compare June 1, 2026 15:56

DiamonDinoia force-pushed the fix/avx512vl-masked-memory branch from fa06792 to 5a40538 Compare June 1, 2026 16:53

Conversation

DiamonDinoia commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codegen — before (master, commit 7d30b9cc)

Codegen — after

Bug fixes

Features added

Cleanups

Uh oh!

Choose a reason for hiding this comment

Uh oh!

serge-sans-paille left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AntoinePrv commented May 25, 2026

Uh oh!

DiamonDinoia commented Jun 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

DiamonDinoia commented May 21, 2026 •

edited

Loading

Codegen — before (master, commit `7d30b9cc`)